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Abstract 

Background: Multi-objective optimization (MOO) involves optimization problems with multiple objectives. 
Generally, theose objectives is used to estimate very different aspects of the solutions, and these aspects are often 
in conflict with each other. MOO first gets a Pareto set, and then looks for both commonality and systematic 
variations across the set. For the large-scale data sets, heuristic search algorithms such as EA combined with MOO 
techniques are ideal. Newly DNA microarray technology may study the transcriptional response of a complete 
genome to different experimental conditions and yield a lot of large-scale datasets. Biclustering technique can 
simultaneously cluster rows and columns of a dataset, and hlep to extract more accurate information from those 
datasets. Biclustering need optimize several conflicting objectives, and can be solved with MOO methods. As a 
heuristics-based optimization approach, the particle swarm optimization (PSO) simulate the movements of a bird 
flock finding food. The shuffled frog-leaping algorithm (SFL) is a population-based cooperative search metaphor 
combining the benefits of the local search of PSO and the global shuffled of information of the complex evolution 
technique. SFL is used to solve the optimization problems of the large-scale datasets. 

Results: This paper integrates dynamic population strategy and shuffled frog-leaping algorithm into biclustering of 
microarray data, and proposes a novel multi-objective dynamic population shuffled frog-leaping biclustering 
(MODPSFLB) algorithm to mine maximum bicluesters from microarray data. Experimental results show that the 
proposed MODPSFLB algorithm can effectively find significant biological structures in terms of related biological 
processes, components and molecular functions. 

Conclusions: The proposed MODPSFLB algorithm has good diversity and fast convergence of Pareto solutions and 
will become a powerful systematic functional analysis in genome research. 




Genomics 



Background 

With rapid development of the DNA microarray technol- 
ogy, simultaneously measuring the expression levels of 
thousands of genes in a single experiment can yield large- 
scale datasets. The analysis of microarray data mainly con- 
tains the study of gene expression under different environ- 
mental stress conditions and the comparisons of gene 
expression profiles for tumors from cancer patients. A sub- 
set of genes showing correlated co-expression patterns 
across a subset of conditions are expected to be functionally 
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related. By comparing gene expression in normal and 
disease sells, microarray dataset may be used to identify dis- 
ease genes and targets for therapeutic drugs. Therefore, 
mining patterns from microarray dataset becomes more 
and more important. These patterns relate to disease diag- 
nosis, drug discovery, protein network analysis, gene regu- 
late, as well as function prediction. 

For microarray data analysis, clustering techniques is a 
popular technique for mining significant biological mod- 
els. Clustering can identify set of genes with similar pro- 
files. However, traditional clustering approaches such as 
k-means [1], self organizing maps [2], support vector 
machine [3] and hierarchical clustering [4], assume that 
related genes have the similar expression patterns across 
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all conditions, which is not reasonable especially when 
the dataset contains many heterogeneous conditions. It 
fact, those relevant genes are not necessarily related to all 
conditions. To cluster subset of genes that have similar 
expression over some conditions, biclustering [5,6] is 
proposed for clustering simultaneously gene subset and 
condition subset over which the gene subset exhibit simi- 
lar expression patterns, such as 5-biclustering [5], pClus- 
tering [7], statistical-algorithmic method for biclustering 
analysis (SAMBA) [8], spectral biclustering [9], Gibbs 
sampling biclustering [10] and simulated annealing 
biclustering [11]. 

In recent three decades, inspired by biology views, 
heuristics optimization has become a very popular 
research topic. To order to escape from local minima, 
many evolutionary algorithms (EA) are used to find glo- 
bal optimal solutions from gene expression data [12-14]. 
If a single objective is optimized, the global optimum 
solution can be found. But in the real-world optimization 
problem, there are several objectives in conflict with each 
other to be optimized and require different mathematical 
and algorithmic tools to solve it. Multi-objective evolu- 
tionary algorithm (MOEA) generates a set of Pareto-opti- 
mal solutions [15] which is suitable to optimize two or 
more conflicting objectives. 

However when mining biclusters from microarray data, 
we must optimize simultaneously several objectives in con- 
flict with each other, for example, the size and the homoge- 
neity of the clusters. In this case MOEA is proposed to 
discover efficiently global optimal solution. Among many 
MOEA proposed, the relaxed forms of Pareto dominance 
has become a popular mechanism to regulate convergence 
of an MOEA, to encourage more exploration and to pro- 
vide more diversity. Among these mechanisms, E-domi- 
nance has become increasingly popular [16], because of its 
effectiveness and its sound theoretical foundation, e-domi- 
nance can control the granularity of the approximation of 
the Pareto front obtained to accelerate convergence and 
guarantee optimal distribution of solutions. At present, sev- 
eral algorithms [17,18] adopt MOEAs to discover biclusters 
from microarray data. 

Recently particle swarm optimization (PSO) proposed by 
Kebnnedy and Eberhart [19] is a heuristics-based optimi- 
zation approach simulating the movements of a bird flock 
finding food. Most of previous versions of the particle 
swarm are based on continuous space, where trajectories 
are the changes of position on some dimensions. Kennedy 
and Eberhart [20] proposed a discrete binary version of 
PSO, where trajectories are defined as changes of probabil- 
ity that a coordinate will take on a zero or one value. 

The most attractive of PSO is that there are very few 
parameters to adjust. So it has been successfully used 
for both continuous nonlinear and discrete binary sin- 
gle-objective optimization. 



The rapid convergence and relative simplicity of PSO 
make it very suitable to solve multi-objective optimiza- 
tion named as multi-objective PSO (MOPSO). In recent 
years many multi-objective PSO (MOPSO) approaches 
[21,22] has proposed. The strategy of e-dominance 
[23,24] is introduced into MOPSO speeding up the con- 
vergence and attaining good diversity of solutions [25]. 
Liu [26] incorporates e-dominance strategies into 
MOPSO, and proposes a novel MOPSO biclustering fra- 
mework to find one or more significant biclusters of 
maximum size from microarray data. 

Most MOPs use a fixed population size to find non- 
dominated solutions for obtaining the Paterto front. The 
computational cost is the greatest influence of population 
size on these population-based meta-heuristic algorithms. 
Hence dynamically adjusting the population size need 
consider the balance between computational cost and the 
algorithm performance. Some methods using dynamic size 
are proposed. Tan [27] proposed an incrementing MOEA 
(IMOEA) that adaptively computes am appropriate popu- 
lation size according to the online discovered trade-off 
surface and its desired population size that corresponds to 
the distribution density. Yen and Lu [28] proposed 
dynamic population size MOEA (DMOEA) that includes a 
population-growing strategy based on the converted fit- 
ness and a population-declining strategy that resorts to 
the following age, health and crowdedness. Leong and Yen 
[29] introduced dynamic population size and a fixed num- 
ber of multiple swarms into multi-objective optimization 
algorithm that improved diversity and convergence of 
optimization algorithm. Based on dynamic population, Liu 
[30] proposed a novel dynamic multi-objective particle 
swarm optimization biclustering (DMOPSOB) algorithm 
to mine effectively significant biclusters of high quality. 

In recent years, Eusuff [31,32] develops a shuffled frog- 
leaping algorithm (SFLA) to solve combinatorial optimi- 
zation problems. Due to its effectiveness and suitability, 
SFLA has captured much attention and been applied to 
solve many practical optimization problems [31-33]. The 
shuffled frog leaping (SFL) optimization algorithm has 
been successful in solving a wide range of real-valued 
optimization problems. Madani [34] proposes a discrete 
shuffled particle optimization algorithm with best perfor- 
mance in terms of both success rate and speed than the 
binary genetic algorithm (BGA) and the discrete particle 
swarm optimization (DPSO) algorithm. 

To the best of our knowledge, there is no published 
work dealing with the biclustering of microarray data by 
using SFLA. Thus, in this paper we present an effective 
SFLA biclustering algorithm for mining the maximum 
biclusters with allowable dissimilarity within the biclusters, 
and with a greater row variance. Computational experi- 
ments and comparisons show that the proposed SFLA 
outperforms three best performing algorithms proposed 
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recently for solving the biclustering problem with the 
biclustering criterion. 

Methods 

Based on shuffled frog-leaping algorithm, crowding dis- 
tance and e-dominance strategy [16], this paper incorporat- 
ing dynamic population strategy into MOSFLB algorithm 
[35], and proposes a multi-objective dynamic population 
shuffled frog-leaping biclustering (MODPSFLB) algorithm 
to mine one or more significant biclusters of maximum 
size from microarray dataset. In the proposed algorithm, 
the feasible solutions are regarded as frogs and Pareto opti- 
mal solutions are preserved in frog population updated by 
e-dominance relation and computation of crowding dis- 
tance. Then the next generation of frog population is dyna- 
mically adjusted according to dynamic population strategy 
[29]. The proposed methods can effectively obtain more 
Pareto optimal solutions that uniformly distributed onto 
the Pareto front. The proposed algorithm uses three objec- 
tives, the size, homogeneity and row variance of biclusters, 
as three fitness function of biclustering optimization pro- 
cess. A low mean squared residue (MSR) score of bicluster 
denotes that the expression level of each gene within the 
bicluster is similar over the range of conditions. Therefore, 
the goal of the algorithm is to find more maximum biclus- 
ters with mean squared residue lower than a given 8 and 
with a relatively high row variance. 

Biclusters 

Given a gene expression data matrix D = GxC = (here ie 
[1, «],/e [1, m]) is a real-valued nxm matrix, here G is a 
set of n genes {gi, g 2 ,..., g n }, C a set of m biological condi- 
tions {ci, c 2 ,..., c n }. Entry dy means the expression level of 
gene gj under condition Cy. 

Definition 1 Bicluster. Given a gene expression data- 
set D = GxC, if there is a submatrix B = gxc, where g c G, 
cc:C, to satisfy certain homogeneity and minimal size of 
the cluster, we say that B is a bicluster. 

Definition 2 Maximal bicluster. A bicluster B = gxc 
is maximal if there exists not any other biclusters B'B'= 
g'xc' g'xc' such that, g' c g, c'cQ 

Definition 3 Dimension mean. Given a bicluster B = 
gxc, with subset of genes g c G, subset of conditions 
ccC, dij is the value of gene g t under condition c ; - in the 
dataset D. We denote by di C dj C the mean of the ith gene 
in B, d g j the mean of the jth condition in B. We also 
denote by d gc the mean of all entries in B. These values 
are defined as follows, where Size(g, c) = |g||c| presents 
the size of bicluster B. 

d ic = uTjecdq (!) 



d gj = TzT.ieg d ij ( 2 ) 



dgc ~ \g\\c\ ^ ie £'i ecdii ® 

Definition 4 Residue and mean square residue. 

Given a bicluster B = gxc, to assess the difference the 
actual value of an element dy and its expected value, we 
define by r(dij) the residue of d;j in bicluster B in Eq.(4). 
Therefore the mean squared residue (MSR) of B is 
defined as the sum of the squared residues to assess 
overall quality of a bicluster B in Eq.(5). 

r{dij) = dy - d ic -dgj + d gc (4) 

MSR(g,c) = — J2ie g ,jec r ( d >j) 2 (5) 

Definition 5 Row variance. Given a bicluster B = 
gxc, the ith gene variance in B is defined by RVAR(i, c) 
and the overall gene-dimensional variance is defined as 
the sum of all genes variance as follows. 

WAR(g, c) = EjegJec {dtj - d ic f (6) 

RVAR{i,c) = ^Y, j M-d tc f (7) 

Our target is mining good quality biclusters of maxi- 
mum size, with mean square residue (MSR) smaller 
than a user-defined threshold 8 > 0, which presents the 
maximum allowable dissimilarity within the biclusters, 
and with a greater row variance. The problem is NP- 
complete, so the large majority of the algorithms use 
heuristic approaches to attain near optimal solutions. 

Bicluster encoding 

Each bicluster is encoded as an individual of the popula- 
tion. Each individual is represented by a binary string of 
fixed length n+m, where n, m is the number of genes, 
conditions of the microarray dataset, respectively. The 
first n bits are responding to n genes, the following m 
bits to m conditions. If a bit is set to 1, it means that 
the responding gene or condition belongs to the 
encoded bicluster; otherwise it does not. This encoding 
method presents the advantage of having a fixed size, 
thus using standard variation operations. Figure 1 pre- 
sents the individual encoding a bicluster with 2 genes 
and 3 conditions, and its size is 2 x 3 = 6. 
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Figure 1 An individual encoding a bicluster. Figure 1 presents the individual encoding a bicluster with 2 genes and 3 conditions, and its size 
is 2 x 3 = 6. 



Fitness function 

We hope to mine those biclusters with low mean 
squared residue, with high volume and gene-dimen- 
sional variance, thus three objectives in conflict with 
each other are used to model multi-objective optimiza- 
tion problem. In this paper, we use the following three 
fitness functions [26]. 



fiix) 



fi{x) 



Igycj 

size(x) 



MSR(x) 
8 



1 



RVAR(x) 



(8) 



(9) 



(10) 



Where G and C are the total number of genes and 
conditions of the microarray datasets respectively. Size 
(x), MSR(x) and RVAR(x) denotes the size, mean 
squared residue and row variance of bicluster encoded 
by the frog x respectively. 8 is the user-defined thresh- 
old for the maximum acceptable mean squared residue. 
Our algorithm minimizes those three fitness functions. 

e-dominance 

Among many MOEA proposed, the non-dominated 
solutions of each generation are kept in an external 
population that must be updated in each generation. 
The time needed for updating the population depends 
on the population size, population size and the number 
of objectives and increases extremely when increasing 
the values of these three factors [36]. To encourage 
more exploration and to provide more diversity the 
relaxed forms of Pareto dominance has become a popu- 
lar mechanism to regulate convergence of an MOEA. 
Among these mechanisms, E-dominance has become 
increasingly popular [16], because of its effectiveness 
and its sound theoretical foundation, e-dominance can 
control the granularity of the approximation of the Par- 
eto front obtained to accelerate convergence and guar- 
antee optimal distribution of solutions. Here, we adapt 
the idea of e-dominance to fix the size of the population 



to a certain amount. This size depends on e. We apply 
e-dominance technique to search for the approximate 
Pareto-front. 

Definition 6 Dominance relation. Let f, g e R m . Then 
f is said to dominate g (denoted as f > g), iff 

(i) Vi e {1,...., m}: fj < gj 

(ii) 3j e {1,...., m}: fj < gj 

Definition 7 Pareto set. Let F e R m be a set of vec- 
tors. Then the Pareto set F* of F is defined as follows: 

F* contains all vectors g e F which are not dominated 
by any vector f e F, i.e. 



F:= {geF|JfeF: f>g} 



(11) 



Vectors in F* are called Pareto vectors of F. The set of 
all Pareto sets of F is denoted as P*(F). 

Definition 8 e-dominance. Let f , g e R m . Then f is 
said to e -dominate g for some e > 0, denoted as f > f g, 
iff for all ie {1,...., m} 



(12) 



Definition 9 e-approximate Pareto set. Let F £ R m 

be a set of vectors and e > 0. Then a set F ( is called an 
e-approximate Pareto set of F, if any vector g e F is 
e-dominated by at least one vector f e F e , i.e. 



Vg e F : 3f e F 6 such that f> e g 



(13) 



The set of all e-approximate Pareto sets of F is 
denoted as P e (F). 

Definition 10 e-Pareto set. Let F £ R m be a set of 
vectors and e > 0. Then a set F % QF is called an e-Pareto 
set of F if 

(i) F* is an e-approximate Pareto set of F, i.e. 
F* 6P e (F), and 

(ii) F* contains Pareto points of X only, i.e. F* C F* 
The set of all e-Pareto set of F is denoted as F* (F) . 

Update of e-Pareto set of the frog population 

In order to guarantee the convergence and maintain 
diversity in the population at the same time, we 
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implement updating of e-Pareto set of the frog popula- 
tion during selection operation [16]. 

Fining the global best solution 

To order to find the global best solutions, we use the 
Sigma method [21] to find the best local guide p g among 
the population members for the frog i of population as 
follows. In the first step, we assign the value O) to each 
frog j in the population. In the second step, a, for frog i 
of the population is calculated. Then we calculate the dis- 
tance between the <7 ; and a,, V; = 2,...,|>l|. Finally, the frog 
k in the population A which its o> has the minimum dis- 
tance to a; is selected as the best local guide for the frog 
i. Therefore, frog p g = x k is the best local guide for frog i. 
In other words, each frog that has a closer sigma value to 
the sigma value of the population member, must select 
that population member as the best local guide. In the 
case of two dimensional objective space, closer means the 
difference between the sigma values and in the case of m- 
dimensional objective space, it means the m-dimensional 
euclidian distance between the sigma values. The algo- 
rithm of the Sigma method can find the best local pg for 
the frog i of the population [21]. Here, the function 
Sigma calculates the <7 value and dist computes the eucli- 
dian distance. y t denotes the objective value of the jth ele- 
ment of the population. 

Shuffled frog-leaping algorithm 

SFL is a population-based cooperative search metaphor 
combining the benefits of the genetic-based memetic 
algorithm and the social behavior based on particle 
swarm optimization. Shuffled frog leaping algorithm is a 
new meta-heuristic proposed by Eusuff [31,32,34] for sol- 
ving problems with discrete decision variables. In the SFL 
algorithm, a population of randomly generated P solu- 
tions forms an initial population X = {x h x%..., x P }, where 
each solution x t called a frog is represented by a number 
of bits Xi = { Xij, Xj2,—, XjN }. 

SFL starts with the whole population partitioned into a 
number of parallel subsets referred to as memeplexes. 
Then eachmemeplex is considered as a different culture 
of frogs and permitted to evolve independently to search 
the space. Within each memeplex, the individual frogs 
hold their own ideas, which can be affected by the ideas 
of other frogs, and experience a memetic evolution. Dur- 
ing the evolution, the frogs may change their memes by 
using the information from the memeplex best x(b) or 
the best individual of entire population x(g). Incremental 
changes in memotypes correspond to a leaping step size 
and the new meme corresponds to the frog's new posi- 
tion. In each cycle, only the frog with the worst fitness x 
(w) in the current memeplex is improved by a process 
similar to PSO. The improving cycle has four steps, in 
the first step it uses a method which in concept is 



somehow similar to the discrete particle swarm optimiza- 
tion algorithm, and for the second and third steps it uses 
the operators of the the binary genetic algorithm (BGA), 
i.e. mutation and crossover [34]. 

Stepl. For d = 1,..., N bit , use Eq.(14) to calculate the 
speed vector of the worst frog VW(. 

iw?; 1 -%[a>. vw n id + Cl ■ n • {pb n id - xufo)+ 

k ■ Mi • c 2 • r 2 • {gb" d - xw n id ) + /x 2 • c 3 • r 3 • {xb ld - xw^) 

where i denotes the worst frog of ith memeplex, n 
represents the iteration number, Pb t is the best position 
visited previously by the worst frog of ith memeplex and 
XB t is the position of the best frog in ith memeplex, and 
df is the constriction factor; Ci, c 2 and c 3 are three positive 
constants called acceleration coefficients (cj = c 2 = c 3 = 
2); r lt r 2 and r 3 are three random numbers uniformly dis- 
tributed between 0 and 1. fij and fi 2 are called the influ- 
ence factors, |ii reflects the influence of the global best 
position on the worst frog and \i 2 reflects the influence of 
the best position of any memeplex imposed on the worst 
frog. As a rule fij and \i 2 are positive decimal fractions. 
The default values of fix and {i 2 are as fij = \i 2 = 0.5. k 
reflects the movement direction, which is selected ran- 
domly, thus if k = 1 the frog moves towards the global 
best position, else k = -1 and it moves in the opposite 
direction, m is called the inertia weight, and is calculated 
from Eq.(14). 

The position of the frog is determined using Eq.(15): 



xw" d = boolean(xw" d + vw n id ) 



(15) 



where 



boolean(x) 



1 if x > 0 
0 otherwise 



If this process produces a better solution, it replaces 
the worst frog; otherwise go to the next step. 

Step2. A mutation operator is applied on the position 
of the worst frog. In the case of improvement, the 
resulted position is accepted; otherwise go to the next 
step. 

Step3. A crossover operator is applied between the 
worst frog of the memeplex and the globally best posi- 
tion. The worst frog is replaced if its fitness is improved; 
otherwise go to the next step. 

Step4. The worst frog is replaced randomly. 

If no improvement becomes possible in this case, then 
x(w) is replaced by a randomly generated solution 
within the entire feasible space. 

After a predefined number of memetic evolution steps, 
the frogs in memeplexes are submitted to a shuffling 
process, where all the memeplexes are combined into a 
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whole population and then the population is again 
divided into several new memeplexes. The memetic 
local search and shuffling process are repeated until a 
given termination condition is reached. 

As a predefined number of improvement cycles is 
reached, memeplexes are shuffled, and if stopping cri- 
teria are not met, the algorithm is repeated. 

Accordingly, the main parameters of DSFL are: number 
of frogs P, number ofmemeplexes m, number of proces- 
sing cycles on each memeplex before shuffling, number 
of shuffling iterations (or function evaluations), number 
of bits for any variable, mutation rate, crossover type, the 
constriction factor, acceleration coefficients and influence 
factors. 

Based on some primary experimental results, the suita- 
ble values were found as follows: number of frogs and 
number of bits for each variable are 60 and 10, respec- 
tively, number of processing cycles on each memeplex 
before shuffling is 10, number of memeplexes is 6. The 
values of other parameters have been mentioned before. 
This paper incorporating dynamic population size. 

Dynamic population strategy 

Generally, multiple-objective optimization focus on two 
competing objectives: (1) to quickly converge to the true 
Pareto front and (2) to maintain the diversity of the solu- 
tions along the resulting Pareto front. Because maintain- 
ing the diversity will slow down the convergence speed 
and may degrade the quality of the resulting Pareto front, 
these two objectives are in conflict each other. In this 
paper, we adopt dynamically adjusting the population 
size to explore the search space in balance between two 
competing objectives. 

Initializing the population 

The initial population is get by running state-of-art MOEA 
(NSGA-II [37]) with 50 individuals and 20 generations to 
produce the initial population of MODPSFLB. 

Adding population size 

Population adding strategy mainly consist in increasing the 
population size to ensure sufficient number of individuals 
to contribute to the search process and to place those new 
individuals in unexplored areas to discover new possible 
solutions. Based on the strategies of dynamic population 
size [29], the procedures proposed in literature [38] is pro- 
posed to facilitate exploration and exploitation capabilities 
for MODPSFLB. 

Decreasing population size 

To prevent the excessive growth in population, a popula- 
tion decreasing strategy [27] is used to adjust the popula- 
tion size. Sigma value is utilized to select potential frogs 
to be deleted. After computing all the distance between 



Sigma value of each frog and Sigma value of its corre- 
sponding best local guide, the rank of the distance of 
each frog can be attained. If the removal of frogs is only 
based upon the distance rank of each frog, then there is a 
possibility of eliminating an excessively large quantity of 
frogs in which some may carry unique schema to contri- 
bute in the search process. A selection ratio is implemen- 
ted to regulate the number of frogs to be removed and to 
provide some degrees of diversity preservation at the 
same time. A selection ratio inspired by Coello and Mon- 
tes [39] is used to stochastically allocate a small percen- 
tage of frogs in the population for removal. 

MODPSFLB biclustering algorithm 

We incorporates dynamic population strategy into multi- 
objective shuffled frog leaping biclustering (MOSFLB) [38] 
algorithm, and propose a multi-objective dynamic popula- 
tion shuffled frog-leaping biclustering (MODPSFLB) to 
mine biclusters from the microarray datasets to attain the 
global optimum solutions. The proposed algorithm consist 
of the following three strategies: (1) e-dominance to 
quicken convergence speed; (2) Sigma method to find 
good local guides; (3) population-growing strategy to 
increase the population size to promote exploration cap- 
ability; and (4) population declining strategy to prevent the 
population size from growing excessively. The pseudo- 
code of the proposed MODPSFLB algorithm is given in 
Algorithm 1. 
Algorithm 1: MODPSFLB Algorithm 
Input: microarray data, minimal MSR 5, a 
Output: the best solutions, that is, the found 
biclusters 
Begin 

Initialize the frog population A according to the 
population initializing stragery 
While not terminated do 

Calculate fitness for each frog 

Add the size of population A according to the 

population adding stragery 

Divide the population into several memeplexes 

For each memeplex 

Determine the best and worst frogs 
Improve the worst frog position x(w) using 
Eq.(15) 

If no improvement in this case then 
x(w) is replaced by a randomly generated frog 
within the entire feasible space 
End for 

Combine the evolved memeplexes 

Select the best frogs using Sigma method and 

e-dominance 

Decrease the size of population A according to 
the population decreasing stragery 
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End while 

Return At the set of biclusters 
END 

MODPSFLB algorithm iteratively updates the frogs 
population until maximum number of generation are 
reached and converge to the optimal solution set. 

Results 

Mitra and Banka applied MOEA to solve biclustering pro- 
blem and proposed MOE Biclustering (MOEB) [17]. To 
obtain the diversity of optimal solution, we apply the pro- 
posed MODPSFLB algorithm to mine biclusters from two 
well known datasets and compare the diversity and con- 
vergence of the algorithm with MOEB, MOPSOB [40] and 
MOSFLB algorithm. The biological significance of the 
biclusters found by MODPSFLB is given in the end. 

Datasets and data preprocessing 

The first dataset is the yeast Saccharomyces cerevisiae cell 
cycle expression data [41], and the second dataset is the 
human B-cells expression data [42]. 

The yeast dataset collects expression level of 2,884 
genes under 17 conditions. All entries are integers lying 
in the range of 0-600. Out of the yeast dataset there are 
34 missing values. The 34 missing values are replaced by 
random number between 0 and 800 [5]. 

The human B-cells expression dataset is collection of 
4,026 genes and 96 conditions, with 12.3% missing values, 
lying in the range of integers -750-650. The missing values 
are replaced by random numbers between -800-800 [SI . 
However, those random values affect the discovery of 
biclusters [43]. The parameter 5, for the yeast data is set 8 
= 300, for the human B-cells expression data 8 = 1200. 

Experiments 

MODPSFLB algorithm is implemented in JAVA pro- 
gramming language and is performed on a 1.7 GHz 



Pentium 4 PC with 512 M of RAM running Windows 
XP. To evaluate its performance, the proposed algo- 
rithm is compared to MOEB, MOPSOB [40] and 
MOSFLB algorithm on two well known datasets 
[41,42]. 

Yeast dataset 

In Table 1, the information of ten biclusters out of the 
one hundred biclusters found on the yeast dataset are 
shown. Table 1 shows that the first hundred biclusters 
found by the proposed MOSFLB algorithm cover 77.7% 
of the genes, 100% of the conditions and in total 57.2% 
cells of the expression matrix. The biclusters found by 
MOSFLB algorithm cover 76.7% of the genes, 100% of 
the conditions and in total 54.3% cells of the expression 
matrix. The biclusters found by MOPSOB [40] cover 
73.1% of the genes, 100% of the conditions and in total 
52.4% cells of the expression matrix. While an average 
coverage of 51.34% cells is reported in MOEB [17]. 

Figure 2 depicts sample gene expression profiles for 
small biclusters (bicluster 63) for the yeast dataset. They 
show that 24 genes present a very similar behaviour 
under 17 conditions. 

Human B-cells expression dataset 

Table 2 shows the information of ten biclusters out of 
the one hundred found on the human dataset. From 
Table 2, we know that the first hundred biclusters 
found by the proposed MOSFLB algorithm cover 42.1% 
cells of microarray dataset (53% of the genes and 100% 
of the conditions). However, the one hundred biclusters 
found by MOSFLB algorithm cover 40.8% cells of 
microarray dataset (51.2% of the genes and 100% of the 
conditions). The one hundred biclusters found by MOP- 
SOB [40] on the human dataset cover 35.7% cells of 
dataset (46.7% of the genes and 100% of the conditions), 
whereas an average of 20.96% cells are covered in 
MOEB [17]. 



Table 1 Information of biclusters found on yeast dataset 


Bicluster 


Genes 


Conditions 


Residue 


Row variance 


1 


101 


15 


215.62 


749.17 


6 


514 


10 


289.65 


955.25 


11 


858 


10 


322.58 


702.36 


22 


478 


11 


298.68 


885.64 


31 


123 


12 


201.88 


699.87 


36 


801 


8 


221.88 


687.18 


44 


1125 


13 


236.47 


598.68 


56 


847 


11 


208.48 


748.54 


75 


546 


9 


250.14 


664.13 


89 


89 


17 


210.88 


666.57 



Table 1 shows the number of genes and conditions, the mean squared residue and the row variance of ten biclusters out of the one hundred biclusters found 
on the yeast dataset. 
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Figure 2 Small biclusters of size 24 x 17 on the yeast dataset. Figure 2 shows the expression value of 24 genes under 17 conditions from 
the small biclusters (bicluster 63). 

\ J 



Comparative analysis 

We compare the proposed MODPSFLB algorithm with 
MOPSOB, MOSFLB and DMOPSOB algorithm on the 
yeast dataset and the human dataset and the results are 
showed in Table 3. 

From Table 3, the biclusters found by MODPSFLB has 
a slightly higher squared residue and a higher bicluster 
size than those by the other three algorithm on both 
yeast dataset and human dataset. It is clear from the 



above results that the proposed MODPSFLB algorithm 
performs best in maintaining the diversity of solutions. 

As for the computation cost, Table 3 shows that the 
computation time of MODPSFLB is least, that is 88.24s 
on yeast dataset and 287.98s on human dataset, is super- 
ior to that of the other thress algorithms. From Table 3, 
we alse know that the algorithm adopting dynamic popu- 
lation strategy has less the computation cost than the 
algorithm not adopting dynamic population strategy. 



Table 2 Biclusters found on human dataset 



Bicluster 


Genes 


Conditions 


Residue 


Row variance 


1 


882 


34 


987.54 


3587.26 


4 


666 


54 


108725 


4201.36 


11 


1024 


36 


773.69 


2930.64 


17 


1102 


39 


1 204.65 


3698.84 


24 


968 


37 


1110.25 


3548.45 


35 


805 


41 


844.44 


2987.01 


39 


871 


48 


2874.17 


2140.36 


AA 


1208 


29 


885.74 


3587.45 


59 


258 


86 


777.58 


2874.94 


88 


1508 


59 


1405 


6658.45 



Table 2 shows the number of genes and conditions, the mean squared residue and the row variance of ten biclusters out of the one hundred biclusters found 
on the human dataset. 
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Table 3 Comparative study of three algorithms 



MOPSOB MOSFLB DMOPSOB MODPSFLB 



Dataset 


Yeast 


Human 


Yeast 


Human 


Yeast 


Human 


Yeast 


Human 


Avg. MSR 


218.54 


927.47 


215.98 


913.53 


216.13 


905.23 


212.8 


904.9 


Avg. size 


10510.8 


34012.24 


1109.23 


35507.22 


11213.5 


35442.98 


11220.7 


35601.8 


Avg. genes 


1 1 02.84 


902.41 


1148.21 


928.12 


1151.25 


932.57 


1154.21 


933.9 


Avg. conditions 


9.31 


40.12 


9.78 


43.11 


9.59 


42.78 


9.81 


43029 


Max size 


15613 


37666 


15709 


37871 


14770 


37231 


14827 


37486 


Avg. time 


120.78 


328.56 


111.41 


319.88 


1 00.47 


310.34 


88.24 


287.98 



Table 3 compares the performance of two algorithms. It gives the average of mean squared residue and the average size of the found biclusters, and gives 
computation cost of two algorithms. 



This show that dynamic population strategy can quicken 
optimization process. 

In total it is clear from the above results that the pro- 
posed MODPSFLB algorithm performs best in maintain- 
ing diversity, achieving convergence. 

Biological analysis of biclusters 

We determine the biological relevance of the biclusters 
found by MODPSFLB on the yeast dataset in terms of 
the statistically significant GO annotation database. The 
gene ontology (GO) project (http://www.geneontology. 
org) provides three structured, controlled vocabularies 
that describe gene products in terms of their associated 
biological processes, cellular components and molecular 
functions in a species-independent manner. To better 
understand the mining results, we feed genes in each 
bicluster to Onto-Express (http://vortex.cs.wayne.edu/ 
Projects.html) and obtain a hierarchy of functional anno- 
tations in terms of Gene Ontology for each bicluster. 

The degree of enrichment is measured by p-values 
which use a cumulative hyper geometric distribution to 
compute the probability of observing the number of 
genes from a particular GO category (function, process 
and component) within each bicluster. For example, the 
probability p for finding at least k genes from a particular 
category within a bicluster of size n is given in Eq.(16). 



fe-i \ i 

p = i-E 

t=0 



g-m 
n — i 



(16) 



Where m is the total number of genes within a cate- 
gory and g is the total number of genes within the gen- 
ome. The p-values are calculated for each functional 
category in each bicluster to denote how well those 
genes match with the corresponding GO category. 

Table 4 lists the significant shared GO terms (or par- 
ent of GO terms) used to describe the set of genes in 
each bicluster for the process, function and component 
ontologies. Only the most significant common terms are 
shown. For example for cluster Ci, we find that the 
genes are mainly involved in Oxidoreductase activity. 
The tuple (n = 13, p = 0.00051) means that out of 101 
genes in cluster Ci, 13 genes belong to Oxidoreductase 
activity Function, and the statistical significance is given 
by the p-value of 0.00051. Those results mean that the 
proposed MODPSFLB biclustering approach can find 
biologically meaningful clusters. 

Conclusions 

This paper proposes a novel multi-objective dynamic 
population shuffled frog-leaping biclustering framework 
for mining biclusters from microarray datasets. We 
focus on finding maximum biclusters with lower mean 
squared residue and higher row variance. Those three 
objective are incorporated into the framework with 
three fitness functions. We apply the following techni- 
ques: a SFL method to balance and control the search 
process, population adding method to dynamically 
grows new individuals with enhanced exploration and 
exploitation capabilities, population decreasing strategy 
to balance and control the dynamic population size, and 
final to quicken convergence of the algorithm. 



Table 4 Significant GO terms of genes in three biclusters 



Cluster No. 


No. of genes 


Process 


Function 


Component 


1 


101 


Lipid transport (n = 21, p = 0.00389) 


Oxidoreductase activity 
(n = 13, p = 0.00051) 


Membrane 
(n = 1 2, p = 0.0023) 


12 


71 


Physiological process 
(n = 43, p = 0.0043) 


MAP kinase activity 
(n = 7, p = 0.00126) 


Cell 

(n = 32, p = 0.00194) 


33 


58 


Protein biosynthesis 
(n = 27, p = 0.00216) 


Structural constituent of ribosome 
(n = 17, p = 0.00132) 


Cytosolic ribosome 
(n = 11, p = 0.00219) 



Table 4 lists the significant shared GO terms which are used to describe genes in each bicluster for the process, function and component ontology. 
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The comparative study of MODPSFLB and three 
state-of-the-art biclustering algorithms on the yeast 
microarray dataset and the human B-cells expression 
dataset clearly verifies that MODPSFLB can effectively 
find significant palocalized structures related to sets of 
genes that show consistent expression patterns across 
subsets of experimental conditions. The mined patterns 
present a significant biological relevance in terms of 
related biological processes, components and molecular 
functions in a species-independent manner. 
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